 |
Reinforcement learning Totally Explained
|
|  |
|
NEW! |
All the latest news in the worlds of
computer gaming,
entertainment,
the environment,
finance,
health,
politics,
science,
stocks & shares,
technology
and much,
much,
more.
|
Everything about Reinforcement Learning totally explainedInspired by related psychological theory, in computer science, reinforcement learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states. In economics and game theory, reinforcement learning is considered as a boundedly rational interpretation of how equilibrium may arise.
The environment is typically formulated as a finite-state Markov decision process (MDP), and reinforcement learning algorithms for this context are highly related to dynamic programming techniques. State transition probabilities and reward probabilities in the MDP are typically stochastic but stationary over the course of the problem.
Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there's a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been mostly studied through the multi-armed bandit problem.
Formally, the basic reinforcement learning model consists of:
- a set of environment states ;
- a set of actions ; and
- a set of scalar "rewards" in
By replacing those expectations with our estimates, , and performing gradient descent with a squared error cost function, we obtain the temporal difference learning algorithm TD(0). In the simplest case, the set of states and actions are both discrete and we maintain tabular estimates for each state. Similar state-action pair methods are Adaptive Heuristic Critic(AHC), SARSA and Q-Learning. All methods feature extensions whereby some approximating architecture is used, though in some cases convergence isn't guaranteed. The estimates are usually updated with some form of gradient descent, though there have been recent developments with least squares methods for the linear approximation case.
The above methods not only all converge to the correct estimates for a fixed policy, but can also be used to find the optimal policy. This is usually done by following a policy π that's somehow derived from the current value estimates, for example by choosing the action with the highest evaluation most of the time, while still occasionally taking random actions in order to explore the space. Proofs for convergence to the optimal policy also exist for the algorithms mentioned above, under certain conditions. However, all those proofs only demonstrate asymptotic convergence and little is known theoretically about the behaviour of RL algorithms in the small-sample case, apart from within very restricted settings.
An alternative method to find the optimal policy is to search directly in policy space. Policy space methods define the policy as a parameterised function with parameters . Commonly, a gradient method is employed to adjust the parameters. However, the application of gradient methods isn't trivial, since no gradient information is assumed. Rather, the gradient itself must be estimated from noisy samples of the return. Since this greatly increases the computational cost, it can be advantageous to use a more powerful gradient method than steepest gradient descent. Policy space gradient methods have received a lot of attention in the last 5 years and have now reached a relatively mature stage, but they remain an active field. There are many other approaches, such as simulated annealing, that can be taken to explore the policy space. Other direct optimization techniques, such as evolutionary computation are used in evolutionary robotics.
Current research
Current research topics include: Alternative representations (such as the Predictive State Representation approach), gradient descent in policy space, small-sample convergence results, algorithms and convergence results for partially observable MDPs, modular and hierarchical reinforcement learning.
Recently, reinforcement learning has been used in the domain of Psychology to explain human learning and performance. In particular, it has been used in cognitive models that simulate human performance during problem solving and/or skill acquisition (for example, Gray, Sims, Fu, & Schoelles, 2006; Fu & Anderson, 2006). It was also used to propose a model of the human error-processing system (Holroyd & Coles, 2002). Multiagent or Distributed Reinforcement Learning is also a topic of interest in current research in this field.
Further Information
Get more info on 'Reinforcement Learning'.
|
External Link Exchanges
Do you know how hard it is to get a link from a large encyclopaedia? Well we're different and will prove it. To get a link from us just add the following HTML to your site on a relevant page:
<a href="http://reinforcement_learning.totallyexplained.com">Reinforcement learning Totally Explained</a>
Then simply click through this link from your web page. Our crawlers will verify your link, extract the title of your web page and instantly add a link back to it. If you like you can remove the words Totally Explained and embed the link in article text.
As long as your link remains in place, we'll keep our link to you right here. Please play fair - our crawlers are watching. Your site must be closely related to this one's topic. Any kind of spamming, dubious practises or removing the link will result in your link from us being dropped and, potentially, your whole site being banned. |
|
|